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EXECUTIVE  SUMMARY 


Background 

Running  performance  often  is  used  to  evaluate  aerobic 
capacity.  Run  tests  are  a  useful  alternative  to  laboratory 
measures  of  oxygen  uptake  because  running  performance  is  related 
to  those  measures.  Run  tests  provide  less  precise  estimates  of 
aerobic  capacity  than  laboratory  measurement,  but  are  much  easier 
to  conduct.  The  use  of  run  tests,  therefore,  is  a  tradeoff 
between  precision  of  estimation  and  simplicity  of  administration. 
Run  tests  must  meet  some  minimum  standard  of  estimation  precision 
to  justify  their  use. 

Objective 

This  report  reviews  the  literature  relating  aerobic 
capacity  to  running  performance.  The  goal  was  to  construct  a 
model  to  predict  run  test  estimation  precision  based  on  the 
distance  or  duration  of  the  test.  The  model  could  answer 
questions  such  as  "How  long  must  a  run  be  to  provide  a  valid 
indication  of  aerobic  capacity?"  and  "How  much  will  precision 
increase  if  a  3-mile  run  is  used  instead  of  a  1.5-mile  run?" 

Approach 

The  published  literature  was  searched  to  identify  studies 
of  maximal  oxygen  uptake  (V02max)  and  running  performance.  A  meta¬ 
analysis  was  conducted  on  reported  correlations  between  V02max  and 
performance  extracted  from  122  studies . 

Results 

The  correlation  between  V02max  and  performance  increased  with 
distance,  but  only  up  to  a  point.  For  fixed-distance  runs,  the 
size  of  the  correlation  increased  up  to  2  km,  then  remained 
constant.  For  fixed-time  runs,  the  correlation  appeared  to  be 
constant  for  runs  of  12  min  or  longer.  Above  these  cutoffs,  the 
fixed-time  correlation  (r  =  .797)  was  slightly  higher  than  the 
fixed-distance  (r  =  .718)  correlation.  These  figures  indicate  a 
standard  error  between  ~3.7  and  ~4  ml^kg'^min'1  compared  to  a 
range  of  ~2 . 5  to  ~3  ml*kg'1«min"1  for  laboratory  tests. 

Conclusions 

Run  tests  should  be  at  least  2  km  in  distance  or  12  min  in 
duration  to  maximize  validity  as  indicators  of  aerobic  capacity. 
Increasing  distance  or  time  beyond  these  minimum  values  does  not 
improve  run  test  validity  as  an  indicator  of  V02max.  Other  things 
equal,  fixed-time  tests  are  preferable  to  fixed-distance  tests. 
These  tests  estimate  aerobic  capacity  with  reasonable  precision. 
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Introduction 


Maximum  oxygen  uptake  (V02max)  is  an  important  indicator  of 
cardiorespiratory  function  (McArdle,  Katch,  &  Katch,  1996). 
Laboratory  tests  that  measure  oxygen  uptake  during  treadmill  runs 
or  cycle  ergometry  are  the  gold  standard  for  assessing  this 
capacity.  These  tests  require  special  equipment  and  significant 
time  investments  to  assess  a  single  subject.  The  resource 
intensive  character  of  the  tests  makes  simpler  alternatives 
attractive  in  many  situations. 

Run  tests  are  a  popular  alternative  means  of  estimating 
V02max.  For  example,  run  tests  are  common  in  fitness  assessments 
of  school  children  and  military  personnel.  There  is  strong 
empirical  justification  for  the  use  of  run  tests  as  a  less 
technology  intensive,  cost-effective  substitute  for  laboratory 
measures  of  V02max.  Prior  reviews  of  V02max  and  running  performance 
(Baumgartner  &  Jackson,  1982;  Knapik,  1989;  Safrit,  Hooper, 

Ehlert ,  Costa,  &  Patterson,  1988)  identify  numerous  studies  that 
reported  V02max- running  performance  correlations  that  typically 
are  between  r  =  .50  and  r  =  .80.  Safrit  et  al .  (1988)  computed  an 

average  of  r  =  .741,  after  correcting  for  measurement  error,  for 
runs  covering  1  mile  or  more  or  lasting  9  minutes  or  longer.1 

Previous  reviews  clearly  demonstrate  that  run  tests  can  be 
valid  indicators  of  aerobic  capacity,  but  an  important  question 
has  not  been  examined  in  detail:  "Do  longer  runs  provide  better 
estimates  of  aerobic  capacity?"  If  so,  the  longest  distance  that 
the  test  population  can  complete  should  be  used  to  estimate 
aerobic  capacity.  Also,  some  populations  (e.g.,  children, 
elderly)  may  not  be  able  to  complete  a  long  enough  run  to  obtain 
acceptable  V02max  estimates.  If  so,  some  other  method  must  be  used 
to  assess  aerobic  capacity.  An  examination  of  the  relationship 
between  run  distance  and  validity  will  answer  important  questions 
such  as  "What  is  the  shortest  run  that  will  meet  a  specified 
validity  criterion  (e.g.,  r  =  .65)?"  and  "How  much  would  be 
gained  if  the  current  test  were  lengthened  by  500  meters?"  The 
answers  to  these  questions  have  important  implications  for  the 
effective  use  of  run  tests,  particularly  in  applied  settings. 

Run  test  validity  as  an  indicator  of  aerobic  capacity2 
should  increase  with  distance.  Aerobic  processes  provide  an 
increasing  proportion  of  the  total  energy  for  performance  as  run 
distance  increases  (Spencer  &  Gastin,  2001.  Increased  dependence 
on  aerobic  energy  should  make  the  rate  at  which  aerobic  energy 
can  be  generated  increasingly  important  for  performance. 
Individual  differences  in  that  rate  (i.e.,  differences  in  aerobic 
capacity)  should  be  more  strongly  related  to  performance  for 
longer  runs.  It  follows  that  VO 2~av,  an  indicator  of  aerobic 
capacity,  should  be  more  strongly  related  to  performance  for 
longer  runs. 
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Two  lines  of  evidence  support  the  above  arguments.  First, 
studies  that  assessed  performance  for  several  distances  have 
found  stronger  correlations  for  longer  runs  (Burke,  1976; 

Farrell,  Wilmore,  Coyle,  Billing,  &  Costill,  1979;  Shaver,  1975; 
Weyand,  Cureton,  Conley,  Sloniger,  &  Liu,  1994) .  Second, 
mathematical  models  based  on  world  records  predict  that  a  1% 
difference  in  aerobic  capacity  yields  a  0.3%  performance 
difference  at  400  m,  but  a  0.997%  difference  at  10  km  (Ward- 
Smith,  1999) .  The  close  correspondence  between  aerobic  capacity 
differences  and  performance  differences  at  longer  distances 
should  translate  into  a  stronger  association  between  the  two  at 
longer  distances. 

The  only  quantitative  review  of  the  V02raax- running 
performance  literature  contradicted  the  prediction  that  validity 
increases  with  run  distance.  Safrit  et  al .  (1988)  found  no 

difference  between  shorter  and  longer  runs  in  their  analysis. 
However,  their  review  only  included  runs  >  1  mile  or  >  9  minutes. 
Most  of  the  increase  in  the  proportion  of  energy  derived  from 
aerobic  processes  occurs  for  shorter  runs  (Ward-Smith,  1999) .  The 
proportion  increases  from  ~7%  at  100  m  to  -74%  at  1500  m,  then  to 
-97%  at  10  km.  If  validity  parallels  dependence  on  aerobic 
energy,  about  75%  of  the  expected  increase  in  validity 
coefficients  was  not  covered  in  Safrit  et  al.'s  (1988)  review. 

This  review  extends  the  quantitative  analysis  of  run  test 
validity  initiated  by  Safrit  et  al .  (1988).  Meta-analysis  (Hedges 

&  Olkin,  1985;  Hunter  &  Schmidt,  1990)  is  used  to  evaluate 
quantitative  models  relating  distance/duration  to  run  test 
validity.  The  review  covers  runs  from  10-m  sprints  to  84.4-km 
ultramarathons . 


Methods 


Literature  Search 

The  literature  search  was  conducted  in  a  series  of  steps 
designed  to  ensure  broad  coverage  of  published  and  unpublished 
research : 

1.  Articles  cited  by  Safrit  et  al .  (1988),  Baumgartner  and 

Jackson  (1982),  and  Knapik  (1989)  formed  the  initial  list  of 
studies . 

2.  The  Medline,  PsychLit,  and  Discus  databases  were  searched  to 
identify  additional  studies  using  Amaximal  oxygen  uptakes; 
with  Arun  timee  or  Arunning"  as  the  primary  keywords. 
Additional  searches  were  performed  with  Amaximum  oxygen 
uptake, "  "maximal  oxygen  capacity, "  "aerobic  capacity, "  and 
AV02maxe  as  variations  on  maximal  oxygen  uptake.  APerformancee 
was  used  as  an  alternative  to  run  time. 
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3.  The  articles  identified  in  steps  1  and  2  were  examined.  Those 
articles  that  reported  at  least  1  relevant  correlation  were 
retained . 

4.  An  ancestry  search  (Rosenthal,  1984;  White,  1994)  was 
conducted  by  examining  the  reference  lists  in  the  articles 
retained  in  step  3 . 

5.  Year-by-year  searches  were  conducted  in  Journal  of  Sports 
Medicine  and  Physical  Fitness,  Medicine  and  Science  in  Sports 
and  Exercise,  European  Journal  of  Applied  Physiology,  and 
Research  Quarterly  for  Sports  and  Exercise.  Each  journal 
contributed  multiple  articles  in  steps  1  through  4.  All 
volumes  of  the  first  two  journals  were  reviewed;  the  latter 
two  journals  were  reviewed  from  1975  to  present. 

6.  The  Naval  Health  Research  Center  and  San  Diego  State 
University  library  catalogues  were  searched  to  identify 
unpublished  studies  (e.g..  Master's  theses.  Doctoral 
dissertations) . 

7.  The  PubMed  database  was  searched  to  identify  any  additional 
publications  appearing  during  the  time  that  references  were 
being  collected.  The  Arelated  articles=  option  of  the  program 
was  examined  for  each  new  article  found.  This  step  updated  the 
ancestry  search. 

The  search  produced  130  relevant  studies,  but  only  122  were 
used  in  the  analyses.  Six  studies  (Butts,  Henry,  &  McLean,  1991; 
Kohrt ,  Morgan,  Bates  &  Skinner,  1987;  Kohrt,  O'Connor  &  Skinner, 
1988;  Krahenbuhl,  Wells,  Brown,  &  Ward,  1979;  Schabort,  Killian, 
St  Clair  Gibson,  Hawley,  &  Noakes,  2000;  Zhou,  Robson,  King,  & 
Davis,  1997)  were  dropped  because  the  run  was  one  of  several 
physically  demanding  activities  performed  in  sequence.  Fatigue 
from  the  other  activities  might  affect  the  validity  coefficients. 
The  study  by  Cureton,  Sloniger,  O'Bannon,  Black,  and  McCormack 
(1995)  was  dropped  because  it  pooled  data  from  several 
investigations.  Other  reports  based  on  parts  of  the  data  included 
more  detail  on  procedures  and  participant  characteristics.  The 
additional  detail  was  useful  for  analyzing  sources  of  variation 
in  run  test  validity.  The  study  by  Doolittle  and  Bigbee  (1968) 
was  dropped  because  it  reported  a  rank-order  correlation  rather 
than  a  Pearson  product -moment  correlation. 

The  remaining  122  studies  reported  results  for  156  distinct 
samples .  Because  participants  in  some  studies  ran  more  than  one 
distance,  a  total  of  273  correlations  were  available  based  on 
V02max  data  from  6,140  individuals  paired  with  10,173  run 
performances . 

Data  Extraction 

The  information  extracted  from  each  report  consisted  of  the 
sample  size,  the  type  of  run  test  (fixed-distance  or  fixed-time) , 
the  distance  run,  the  average  run  time,  and  the  V02max-running 
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performance  correlation.  Performance  was  recorded  a  number  of 
different  ways  in  different  studies.  Performance  on  fixed- 
distance  tests  was  usually  recorded  as  a  run  time,  but  sometimes 
was  represented  by  average  running  velocity.  Performance  on 
fixed-time  tests  typically  was  recorded  as  distance,  but 
sometimes  was  reported  as  a  predicted  V0?mav .  V02max  predictions 
usually  were  computed  using  equations  that  involved  only  run 
distance.  However,  in  some  cases  the  predictions  were  based  on 
multivariate  equations  with  other  predictors  such  as  weight  or 
gender . 

Two  steps  were  taken  to  make  sure  that  correlations  were 
comparable  across  studies.  All  of  the  correlations  that  used  run 
time  as  the  performance  criterion  were  reversed.  For  each  of  the 
other  criteria,  higher  values  indicated  better  performance.  The 
correlations,  therefore,  were  nearly  all  positive.  When  run  time 
was  the  criterion,  lower  scores  indicated  better  performance  and 
nearly  all  correlations  were  negative.  Reversing  the  sign  for 
these  correlations  meant  that  the  results  from  all  studies  were 
expressed  using  coefficients  that  indicated  how  strongly  V02max 
was  related  to  good  performance. 

The  second  step  taken  to  ensure  that  correlations  were 
comparable  restricted  the  set  of  results  for  the  estimated  V02raax 
criterion.  Correlations  between  measured  and  estimated  V02max  were 
included  in  the  review  only  if  the  prediction  equation  was  a 
linear  function  of  distance  with  no  other  predictors.  When  these 
conditions  are  met,  the  prediction  is  merely  a  linear 
transformation  of  distance.  Linear  transformations  of  variables 
produce  correlations  that  are  identical  to  those  for  the  variable 
itself  (Hays,  1963),  so  the  correlation  between  V02max  and 
predicted  V02max  would  be  identical  to  the  correlation  between 
V02max  and  distance.  This  identity  did  not  apply  in  studies  where 
other  predictors  (e.g.,  weight,  gender)  or  higher  powers  of 
distance  (e.g.,  distance  squared)  were  included  in  the  predictive 
equation . 

A  separate  record  was  constructed  for  each  run  test  in  a 
study.  Thus,  if  a  study  included  1500-m,  5-km,  and  10 -km  runs,  a 
separate  record  was  constructed  for  each  distance.  Sample 
attributes  were  duplicated  on  each  record.  Each  record  was 
treated  as  a  separate  case  in  the  analysis.  This  decision  meant 
that  the  cases  analyzed  were  not  entirely  independent,  thereby 
introducing  statistical  complexities  for  significance  testing 
(Becker  &  Schram,  1994).  The  common  meta-analytic  practice  of 
averaging  effect  sizes  to  produce  a  single  value  for  each  sample 
was  not  suitable  for  the  present  purposes.  Averaging  would  have 
prevented  meaningful  analysis  of  the  relationship  between 
validity  and  test  length. 
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Data  Analysis 


Rosenthal  and  DiMatteo  (2001)  capture  the  intended  spirit 
of  the  present  data  analysis  with  two  observations:  "Meta¬ 
analysis  is  not  inherently  different  from  primary  data  analysis; 
it  requires  the  same  basic  tools,  thought  processes,  and 
cautions"  (Rosenthal  &  DiMatteo,  2001,  p.  78) .  "The  best  quality 
scientific  exploration  is  often  one  that  poses  unadorned, 
straightforward  questions  and  uses  simple  statistical  techniques 
for  analysis"  (Rosenthal  &  DiMatteo,  2001,  p.  68) .  A  meta¬ 
analysis  can  appear  complex  because  it  involves  a  number  of 
decision  points  (Wanous,  Sullivan,  &  Malinak,  1989)  and  because 
effect  sizes  are  analyzed  rather  than  raw  data.  However,  the 
essential  computational  procedures  are  analogous  to  familiar 
procedures  for  computing  descriptive  statistics,  analysis  of 
variance  (ANOVA) ,  and  regression.  The  central  components  of  the 
procedures  in  this  paper  were: 

A.  Olkin  and  Pratt's  (1958)  correction  for  sample  bias  in  the 

estimated  correlations  was  applied.  Hedges  and  Olkin  (1985) 
note  that  this  correction  is  most  important  when  0.4  #  r  # 
0.6  and  sample  size  is  small  (e.g.,  n  <  15).  The  average 
correlation  reported  by  Safrit  et  al .  (1988)  was  just  above 

the  upper  end  of  this  range,  and  many  of  the  correlations 
reviewed  (66  of  273,  24.2%)  were  from  samples  with  n  #  15. 
These  figures  suggested  that  the  unbiased  correlations 
should  be  used  to  protect  against  underestimating  the  true 
population  correlation. 

B.  Fisher's  r-to-z  transformation  was  applied  to  normalize  the 
distribution  of  correlations .  The  data  points  analyzed  and 
predicted,  therefore,  are  labeled  zUF(1)  as  a  reminder  that 
they  are  unbiased,  Fisher- transformed  estimates  of  the 
population  correlations  for  a  given  sample,  denoted  by  the 
"i "  in  the  subscript . 

C.  Each  reported  correlation  was  compared  to  a  predicted  value 
(i.e.,  zUFU)  -  zUF')  .  The  predicted  values  were  familiar 
elements  of  standard  analysis  procedures.  For  example,  the 
predicted  values  in  one  analysis  of  variance  model  were  the 
means  for  all  tests  of  specific  distances  (e.g.,  800  m, 

1500  m) .  The  predicted  values  in  another  analysis  were 
determined  from  the  regression  of  zUF(1)  on  the  logarithm  of 
distance . 

D.  The  difference  between  the  observed  and  predicted  values 
was  standardized.  This  was  accomplished  by  dividing  ZnF(i)  - 
zVF'  by  the  standard  deviation  for  the  transformed 
correlation  (i.e.,  1//(N1  -  3)). 

E.  The  standardized  value  for  the  difference  was  squared  to 
produce  a  n2  with  1  degree  of  freedom  (Hays,  1963). 

F.  The  n2  values  for  all  correlations  in  the  analysis  were 

summed  to  produce  an  overall  n2  that  was  the  summary  fit 
statistic  for  the  model. 


4 


G.  The  n2  values  for  competing  models  were  compared  to 

determine  which  model  best  accounted  for  the  observed 

variation  in  the  correlations. 

This  summary  shows  that  the  computations  involve  differences 
between  observed  and  predicted  values.  The  differences  are 
directly  comparable  to  the  deviations  and/or  residuals  computed 
for  descriptive  statistics,  ANOVA,  or  regression  analyses  of  raw 
data.  The  statistical  comparisons  between  models  are  comparable 
to  using  incremental  variance  explained  to  select  a  model  in 
primary  data  analyses . 

Meta -analysts  must  choose  between  fixed-effects  and  random- 
effects  models  (Hedges  &  Olkin,  1985;  Hedges  &  Vevea,  1998; 
Raudenbush,  1994).  Fixed-effects  models  were  the  starting  point 
for  the  analyses,  but  a  random-effects  model  was  the  end  point. 
Fixed-effects  models  have  smaller  error  variances  than  random- 
effects  models  (Becker  &  Schram,  1994;  Erez,  Bloom  &  Wells,  1996; 
Hedges  &  Vevea,  1998).  Smaller  error  variance  means  larger 
standardized  differences  for  fixed-effects  analyses  than  for 
random-effects  analyses.  The  overall  model  Tl2  is  the  sum  of  the 
squared  standardized  values  (Hays,  1963),  so  underestimating 
error  variance  increases  n2.  This  fact  makes  fixed-effects 
analyses  lenient  relative  to  random-effects  models.  However, 
fixed-effects  models  are  a  necessary  first  step  in  the  iterative 
computation  of  the  random-effects  variance  estimate  in  any  case. 
Hedges  and  Vevea ' s  (1998)  procedures  were  used  to  compute  a 
random-effects  model  after  using  fixed-effects  analyses  to  choose 
between  models.  This  decision  made  it  possible  to  compare  the 
models  directly  because  each  model  was  being  used  to  account  for 
the  same  n2.  Hedges  and  Vevea's  (1998)  Equation  10  was  used  to 
compute  the  random-effects  component  of  variance  following  the 
initial  fixed-effects  analysis. 

Analyses  were  conducted  with  the  general  linear  model  (GLM) 
and  linear  regression  procedures  in  SPSS-PC  (SPSS,  Inc., 

1998a, b) .  The  weighted  least  squares  option  in  each  procedure  was 
used  to  apply  the  (n  -  3)  weight.  Using  this  weighting  option, 
the  sums  of  squares  reported  in  the  analysis  results  are  n2 
values  equal  to  Hedges'  Q  (cf..  Hedges  &  Olkin,  1985,  pp.  235- 
241) .  The  GLM  procedure  was  used  for  analyses  involving  discrete 
groups  (e.g.,  males  and  females)  and  for  multivariate  models. 
Linear  regression  was  used  for  analyses  of  nominally  continuous 
variables  (e.g.,  age).  Nominally  continuous  variables  were 
covariates  in  the  multivariate  models. 

Model  Comparison  and  Selection 

Statistical  significance  tests  are  an  imperfect  guide  to 
model  selection  (Morrison  &  Henkel,  1970;  Harlow,  Mulaik,  & 
Steiger,  1997).  Even  very  small  effects  are  statistically 
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significant  when  examined  in  large  samples  (Rosenthal  &  Rosnow, 
1984) .  Including  weak  effects  in  a  model  increases  parametric 
complexity  with  little  gain  in  predictive  accuracy.  Thus,  the 
question  of  whether  the  increase  in  explanatory  power  justifies 
the  increased  complexity  of  the  model.  Identification  of  a 
parsimonious  model,  therefore,  involves  a  tradeoff  between 
explanatory  power  and  complexity  (Popper,  1959;  Mulaik  et  al, 
1989)  . 

Two  steps  were  taken  to  foster  parsimony.  Hoelter's  (1983) 
critical  N,  the  smallest  sample  size  for  which  an  observed 
difference  would  be  statistically  significant,  was  applied.  If 
critical  N  is  large,  the  effect  arguably  is  too  small  to  be 
important.  Hoelter's  (1983)  rule  of  thumb  that  critical  N  should 
be  >  200  was  adopted  to  identify  effect  sizes  too  small  to  be 
practically  or  theoretically  important. 

The  second  protection  against  unnecessarily  complex  models 
was  based  on  goodness  of  fit  statistics  for  the  model  (cf., 
Arbuckle  &  Wothke,  1999;  Bentler  &  Bonnet,  1980,  Bollen,  1989, 
for  discussions  of  goodness  of  fit) .  The  Tucker-Lewis  index  (TLI, 
Tucker  &  Lewis,  1973)  was  adopted  as  a  goodness-of-f it  indicator: 

tli  =  <nN2/dfN  -  nM2/dfM)/ tnN2/dfN  -  d 

where  N  indicates  the  null  model  and  M  indicates  the  alternative 
model.  The  expected  value  of  IT2  is  1.00  when  chance  is  the  only 
source  of  variation,  so  TLI  was  the  proportion  of  the  greater 
than  chance  variation  in  the  observed  correlations  accounted  for 
by  a  model.  James,  Mulaik,  and  Brett's  (1982)  parsimony 
adjustment  then  was  applied: 

PTLI  =  TLI*  (dfM/dfN) 

Basically,  PTLI  increases  when  the  proportional  gain  in 
explanatory  power  exceeds  the  proportional  decrease  in  degrees  of 
freedom.  Mulaik  et  al.  (1989)  explain  the  rationale  for  this 
adjustment  in  detail. 


Results 

Fixed-distance  and  fixed-time  tests  were  considered 
separately .  This  approach  avoided  confounding  cases  in  which 
distance  or  time  was  an  experimental  design  variable  defining  the 
run  test  with  cases  where  the  same  variables  were  performance 
indices . 

Fixed-distance  Tests 
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General  Pattern.  The  LOESS  curve  (cf.,  Cleveland,  1979)  in 
Figure  1  (see  p.  7)  shows  the  basic  pattern  of  data  relating 
validity  to  run  distance. 


Logarithm  of  Distance 

Figure  1 

Validity  Coefficients  as  a  Function  of  Distance 

Group  Classification  Models.  Fixed-distance  tests  were 
grouped  several  ways  to  generate  group-based  models .  These  models 
shared  the  common  characteristic  that  the  grouping  procedure 
would  explain  the  observed  variation  in  validity  coefficients 
only  if  the  runs  within  a  group  shared  a  common  value.  Fixed- 
distance  tests  were  classified  as  short-  (<1500  m) ,  middle- 
(1500-1850  m)  ,  or  long-distance  (32000  m)  runs.  One  model  (S/M/L) 
treated  each  category  separately.  Two  other  models,  (SM/L  and 
S/ML)  explored  the  effect  of  treating  middle-distance  runs  as 
short  runs  or  long  runs,  respectively. 

The  most  extensive  group  model  consisted  of  24  groups. 
Twenty-two  (22)  groups  were  specific  run  distances  (e.g.,  400  m, 
800  m,  5  km) .  A  separate  group  was  included  for  any  run  that  had 
been  studied  in  3  or  more  samples.  The  other  two  groups  in  this 
model  consisted  of  all  short  (<1500  m,  n  =  9)  and  long  (>1600  m, 
n  =  9)  runs  that  had  been  studied  in  only  1  or  2  samples.  This 
model  was  labeled  the  "test-by-test"  ( TxT )  model  to  emphasize 
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that  individual  run  tests  were  treated  separately  when  there  was 
enough  data  to  provide  a  reasonably  stable  estimate  of  the 
V022max-running  performance  correlation. 

Table  1.  Comparison  of  Group-based  Models 


Model 

df 

n2 

TLI 

PTLI 

S/ML 

1 

203.310 

.292 

.291 

SM/L 

1 

189.051 

.271 

.270 

S/M/L 

2 

229.561 

.325 

.322 

TxT 

23 

355.827 

.425 

.382 

Note.  S  =  Short,  M  -  Medium,  and  L  =  Long.  See  text  for  group 
definitions,  "df"  is  "degrees  of  freedom."  "TLI"  and  "PTLI"  are 
the  Tucker-Lewis  index  and  the  parsimony-adjusted  Tucker-Lewis 
index,  respectively.  The  tabled  n2s  indicate  the  variation  in 
correlations  accounted  for  by  the  model.  The  overall  n2  was 
911.725  with  225  df . 


The  TxT  model  clearly  was  the  best  group  alternative  (Table 
1) .  This  model  was  a  significant  improvement  on  the  next  best 
alternative,  the  S/M/L  model  ()FI2  =  126.27,  21  df,  p  <  .001). 

Even  allowing  for  differences  in  parsimony,  the  goodness  of  fit 
of  the  TxT  model  (PTLI  =  .382)  was  better  than  the  S/M/L  model 
fit  (PTLI  =  .322).  The  S/M/L  model  was  significantly  better  than 
either  dichotomous  model  ()n2  >  26.25,  1  df,  p  <  .001). 

Models  with  Distance  as  a  Continuous  Variable .  A  second  set 
of  models  used  distance  as  a  continuous  variable.  These  models 
included  simple  regression  and  analysis  of  covariance  (ANCOVA) 
models.  The  ANCOVA  models  tested  the  hypothesis  that  variations 
in  the  size  of  the  correlations  within  the  2-  and  3 -group  models 
could  be  accounted  for  by  distance.  If  so,  it  would  be 
unreasonable  to  treat  the  tests  within  a  group  as  equivalent. 
Preliminary  analyses  showed  that  a  logarithmic  transformation  of 
distance  increased  the  predictive  power  of  the  analyses,  so  this 
transformation  was  used  in  constructing  these  models. 

The  analyses  led  to  a  mixed  model  that  regressed  validity 
on  distance  for  shorter  runs,  but  treated  longer  runs  as  a  single 
group  with  a  common  validity  (Table  2) .  A  significant  amount  of 
variation  in  the  validity  coefficients  could  be  accounted  for  by 
regressing  zUF'  on  distance  (LogDist  model;  n2  >  226.80,  PTLI  = 
.324).  However,  both  ANCOVA  models  improved  on  this  basic 
regression  model  ()n2  >  17.25,  2  df,  p  <  .001).  The  SM/L  model 
was  the  better  alternative  between  the  two  ANCOVA  models  (SM/L 
PTLI  =  .359;  S/ML  PTLI  =  .338). 

The  final  mixed  model  was  developed  because  the  regression 
lines  were  not  parallel  for  the  two  SM/L  groups  (II2  =  15.65,  1 
df,  p  <  .001;  cf..  Walker  &  Lev,  1953,  pp .  390-393,  for  the 
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statistical  test) .  The  logarithm  of  distance  predicted  zUF  in  the 

SM  group  (FI2  =  69.724,  1  df,  p  <  .001),  but  not  the  L  group  ( FL2  = 
0.08,  1  df ,  p  >  .777)  . 

Table  2.  Models  with  Distance  as  a  Continuous  Variable. 


Df 

n2 

TLI 

PTLI 

Logic  Distance 

1 

226.800 

.326 

.324 

S/ML  ANCOVA 

3 

244.05 

.342 

.338 

SM/L  ANCOVA 

3 

258.85 

.364 

.359 

PW 

2 

258.77 

.368 

.363 

Note.  See  text  for  description  of  models.  The  within  and  between 
values  for  the  PW  model  indicate  the  contribution  of  each  model 
element  on  total  n2  for  the  SM/L  ANCOVA. 


The  mixed  model  then  was  constructed  based  on  the  ANCOVA 
results  and  Figure  1.  The  model  was: 

If  distance  <  2000  m,  zUF'  =  (.225*LogD)  -  .0036 
If  distance  3  2000  m,  zUF'  =  .9026 

where  "LogD"  was  the  logarithm  of  test  distance.  This  mixed  model 
was  labeled  the  "piecewise"  (PW)  model  because  it  had  distinct 
prediction  components  for  different  ranges  of  distance.  The  PW 
model  fit  the  data  almost  as  well  as  the  full  SM/L  ANCOVA  (II2  = 

258.85  versus  FI2  =  258.77).  The  PW  PTLI  was  higher  (.359  versus 
.363)  .3 

Comparing  the  Best  Models 

The  next  analysis  compared  the  TxT  and  PW  models  as  the 
best  alternatives  within  the  two  general  categories  of  model.  The 
TxT  model  fit  the  data  better  ( )n2  >  98.06,  21  df,  p  <  .001),  but 
much  of  the  difference  was  attributable  to  the  greater  parametric 
complexity  of  the  TxT  model.  The  PTLI  values  were  similar  (TxT 

=  -382;  PW  PTLI  =  .363).  The  sampling  variability  of  PTLI  is 
not  known  and  the  specific  method  of  quantifying  the  parsimony 
adjustment  is  only  a  rule  of  thumb.  Under  these  conditions,  a 
PTLI  difference  of  .019  was  close  enough  to  compare  the  models 
further . 

Figure  2  compares  the  model  predictions  for  the  22  run 
distances  that  had  been  studied  in  3  or  more  samples.  Differences 
in  the  predictions  from  the  two  models  generally  were  small. 
Figure  3  illustrates  this  fact  by  expressing  the  differences  as 
n2s .  Because  the  TxT  prediction  minimizes  the  weighted  squared 
error  for  each  run  distance,  Figure  3  also  illustrates  the  loss 
in  predictive  accuracy  by  replacing  the  TxT  model  with  the  PW 
model . 
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The  effect  of  a  given  run  test  on  the  overall  n2  difference 
between  the  TxT  model  and  PW  models  depends  on  the  size  of  the 
difference  and  the  sample  size  for  the  test  (Rosenthal  &  Rosnow, 
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Fisher-Transformed  Correlation 


\  *  Piecewise 

i  . 

OOJ — - — - , - , - , - , - , — , _ ,  .  1  Test  x  Test 

50.00  550.00  1500.00  3000.00  7800.00  21100.00 

Distance 

Note.  Distance  is  not  to  scale.  Test  x  Test  groups  equally  spaced. 


Figure  2 

Piecewise  and  Test  x  Test  Predictions 


1984)  .  Figure  3  plots  the  differences  between  predictions  after 
translating  each  into  a  z-score,  then  squaxing  that  score.  These 
computations  express  each  difference  as  a  n2  with  1  degree  of 
freedom  (Hays,  1963).  Stavig  and  Acock's  (1976)  procedure  was 
used  to  determine  which  n2s  were  greater  than  expected  by  chance. 

Only  14%  (3  of  22)  of  the  differences  were  greater  than 
expected  by  chance.  One  significant  difference  was  of  limited 
practical  importance  because  it  was  the  product  of  a  small  effect 
size  combined  with  a  large  (N  =  650)  sample  size.  Substantial 
differences  between  the  models  were  limited  to  2  of  22  run 
distances.  The  data  for  these  2  run  distances  consisted  of  7 
correlations  involving  225  performance  scores.  For  practical 
purposes,  the  two  models  provided  effectively  equivalent 
predictions  for  -97%  of  the  data  (3.4%  of  208  correlations;  2.7% 
of  8,505  performance  scores)  reviewed.  Notice  also  that  both  runs 
for  which  the  showed  large  significant  errors  in  prediction  were 
<  1  km  in  length.  Thus,  neither  substantial  error  was  for  a  run 
that  would  be  classified  as  an  endurance  test  in  the  PW  model. 

Within-Study  Evaluation 

The  stability  of  the  V02max-running  performance  correlation 
from  2  km  on  was  surprising  in  light  of  bioenergetic  models. 
However,  the  proportion  of  energy  derived  from  aerobic  processes 
increases  relatively  slowly  for  longer  runs  (Capelli,  1999;  di 
Prampero,  et  al . ,  1993;  Ward-Smith,  1999).  The  underlying  logic 
of  using  bioenergetic  models  to  predict  validity  trends, 
therefore,  implies  that  validity  will  increase  slowly  for  longer 
runs.  If  the  true  validity  differences  are  small,  sampling 
variation  and  methodological  differences  between  studies  could 
mask  the  upward  trend. 

Within-study  analyses  were  conducted  to  increase  the 
sensitivity  of  the  analyses.  Those  samples  in  the  data  set  that 
performed  two  or  more  runs  were  identified.  The  V02max-perf ormance 
correlations  were  compared  for  all  pairwise  combinations  of  tests 
in  each  sample.  Because  the  people  and  methods  are  the  same  for 
each  correlation  in  a  pair,  sampling  effects  and  methods  variance 
are  constant.  If  there  is  no  effect  of  distance,  the  comparisons 
should  show  that  the  longer  run  produced  the  larger  correlation 
50%  of  the  time.  If  the  bioenergetic  predictions  are  correct,  the 
longer  run  should  produce  the  larger  correlation  more  than  50%  of 
the  time. 

The  within-study  comparisons  were  consistent  with  Figure  1 
and  the  PW  model.  The  correlation  for  the  longer  run  was  larger 
in  86%  (134  of  156)  of  the  pairwise  comparisons  when  at  least  one 
run  was  <  2  km.  The  longer  run  produced  the  larger  correlation 
only  54%  (22  of  41)  of  the  time  when  both  runs  were  >  2  km.  The 
frequency  of  a  larger  correlation  for  the  longer  run  was  greater 
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than  chance  for  short  runs  (z  =  8.97,  p  <  .001)  but  not  long  runs 
(z  =  0.47,  p  >  .319) . 

Random  Effects  Model .  The  preceding  analyses  favored  the  PW 
model.  Therefore,  a  random-effects  version  of  that  model  was 
computed  using  Hedges  and  Vevea's  (1998)  procedures: 

If  distance  <  2000  m,  zUF'  =  (,259*LogD)  -  .108 
If  distance  3  2000  m,  zUF'  =  .9518 

The  random-effects  model  produced  smaller  n2  values  than  the 
fixed-effects  model.  This  trend  was  expected  given  the  larger 
variance  used  to  standardize  differences.  The  shift  to  a  random- 
effects  model  did  not  change  the  inferences  about  the  model 
components.  The  regression  of  zUF  on  the  logarithm  of  distance 
still  was  significant  for  the  shorter  runs  (n2  =  27.423,  1  df,  p 
<  .001),  but  not  the  longer  runs  (FI2  =  0.258,  1  df,  p  >  .611). 

The  difference  between  the  average  value  for  the  shorter  and 
longer  runs  remained  significant  (FI2  =  69.776,  1  df,  p  <  .001). 
The  overall  model,  therefore,  was  significant  (II2  =  97.199,  2  df, 

p  <  .001) 

Fixed- time  Tests 

A  total  of  47  fixed-time  tests  were  included  in  the  review. 
This  set  included  4  5-min  tests,  4  6-min  tests,  3  9-min  tests,  1 
10 -min  test,  30  12 -min  tests,  and  5  15 -min  tests.  The  average 
validity  for  the  47  fixed-time  tests  was  r  =  .752. 

Test  time,  the  fixed-time  equivalent  of  test  distance  for 
fixed-distance  runs,  was  positively  related  to  correlation 
magnitude  (FI2  =  61.710,  5  df,  p  <  .001).  The  average  correlations 
suggested  three  sets  of  comparable  tests:  Set  A  =  {6-min,  r  = 
.485);  Set  B  =  {5-min  =  .659;  9-min,  r  =  .645,  10 -min,  r  =  .629}; 
Set  C  =  {12-min,  r  =  .791;  15 -min,  r  =  .835}. 

A  trend  toward  higher  correlations  for  longer  tests  was 
evident.  The  trend  was  most  evident  as  a  contrast  between  tests 
312  min  and  tests  #10  min.  Based  on  this  observation,  a  model 
that  treated  each  of  the  5  tests  as  separate  groups  was  compared 
to  two  alternatives : 

A.  Regression:  The  linear  regression  of  zUF  on  time  was 

significant  (r  -  .473,  FI2  =  51.586,  y'  -  .  00137*Seconds  + 

. 07905)  . 

B.  Dichotomous:  Short  (#  10  min;  k  =  12 )  tests  were  compared  to 

long  (3  12  min;  k  =  35)  tests.  Differences  among  short 
tests  were  nonsignificant  (IT2  =  6.30,  3  df,  p  >  .097). 
Differences  among  long  tests  were  nonsignificant  (F[2  = 

2.173,  1  df ,  p  >  .140) .  The  difference  between  short  and 
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long  tests  was  highly  significant  (IT2  =  53.232,  1  df,  p  < 
.001). 

The  5-group  model  predicted  better  than  the  time  regression 
model  OEh  =  10.124,  4  df,  p  <  .039),  but  did  not  improve 
significantly  on  the  dichotomous  model  ()112  =  8.478,  4  df,  p  > 
.075).  Goodness  of  fit  favored  the  dichotomous  model  (PTLI  - 
.258)  over  the  regression  model  (PTLI  =  .249)  and  the  5-group 
model  (PTLI  =  .198). 

Boundary  Case  Analysis  for  an  Endurance  Criterion 

The  PW  model  and  the  analyses  of  fixed-time  tests  suggested 
an  empirical  definition  of  the  term  "endurance  test."  Setting  the 
criterion  of  >2  km  or  >12  minutes  provided  a  reasonable  working 
definition  of  an  endurance  test.  The  definition  treats  all  tests 
above  the  distance/time  cutoff  as  equally  valid.  All  tests  below 
the  cutoff  have  lower  validity. 

The  appropriateness  of  the  proposed  boundary  criteria  was 
evaluated.  Two  predictions  were  made  for  4  boundary  cases,  the 
1500-meter,  1-mile,  2-kilometer,  and  1.5-mile  runs.  One 
prediction  was  based  on  the  validity-distance  regression  for  runs 
<1500  m  (zUF'  =  ,195*Logi0D  +  .0380).  The  second  prediction  was 
the  average  correlation  for  tests  >1.5  miles  ( zUF '  =  .8678). 
Hoelter's  (1983)  critical  N  was  used  to  evaluate  the  differences 
(zuf  -  zUF')  .  Disch,  Frankiewicz  and  Jackson's  (1975)  names  for 
their  two  running  performance  factors  were  adopted  to  label  the 
two  predictions  "speed"  and  "distance,"  respectively. 

Table  3.  Goodness  of  Fit  for  Boundary  Tests 

Critical  N  if  Classified  as: 


Test 

Average  zUF 

Speed 

Distance 

1500m 

.7442 

512 

255 

1609m 

.7317 

825 

211 

2  000m 

1 . 0167 

38 

176 

2414m 

1 . 0754 

30 

93 

Note.  See  text  for  definition  of  speed  and  endurance  tests. 

The  evaluations  supported  the  proposed  criteria.  The 
critical  N  for  each  proposed  classification  was  2  to  4  times 
larger  than  that  for  the  alternative  classification.  Larger 
critical  Ns  indicate  better  prediction,  so  the  proposed  criteria 
assigned  each  boundary  test  to  the  portion  of  the  model  that 
provided  better  predictive  accuracy.  The  critical  Ns  for  the  2-km 
and  1 . 5-mi  tests  were  low,  but  larger  for  their  proposed 
assignment  than  for  the  alternative.  In  these  cases,  the  fit  of 
the  model  was  not  as  good  as  one  would  like,  but  the  initial 
classification  was  the  lesser  of  two  evils. 
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Discussion 


This  review  tested  the  hypothesis  that  the  validity  of  run 
tests  as  indicators  of  V02max  increases  continuously  with 
distance.  The  hypothesis  was  not  supported.  Validity  increased  up 
to  2  km,  then  remained  stable.  For  fixed-time  tests,  validity  was 
stable  for  runs  >  12  minutes.  The  average  validity  for  longer 
duration  runs  was  comparable  to  that  for  longer  distance  runs. 
Given  this  similarity,  an  endurance  run  can  be  defined  as  any  run 
>  2  km  in  distance  or  >  12  minutes  in  duration.4  This  definition 
identifies  a  set  of  run  tests  that  all  possess  the  same  optimal 
validity  as  indicators  of  aerobic  capacity. 

The  recommended  definition  of  endurance  runs  involves 
longer  runs  than  some  common  testing  practices  (cf.,  Baumgartner 
&  Jackson,  1982).  However,  the  definition  is  consistent  with 
Disch  et  al.'s  (1975)  factor  analysis  of  performance  for  run 
tests  ranging  from  50  yards  to  2  miles.  Two  factors,  "speed"  and 
"distance,"  were  identified.  The  authors  originally  classified  a 
1-mile  run  as  a  distance  test,  but  noted  that  "...  shorter 
distance  tests  of  1  mi  or  less  tended  to  .  .  .  [load]  on  both 

factors,  whereas,  the  distances  longer  than  1  mi  tended  to  be 
unidimensional  and  loaded  almost  exclusively  on  the  distance  run 
factor"  (Disch  et  al . ,  1975;  p.  169) .  The  shortest  distance 
exceeding  1  mile  in  their  study  was  2.01  km  (1.25  miles).  Thus, 
proposed  criteria  are  consistent  with  at  least  one  prior  study. 

The  proposed  time  criterion  for  an  endurance  test  is 
approximate.  The  12-minute  run  is  an  endurance  test.  The  9-minute 
run  is  not.  The  optimum  criterion  might  fall  between  these  two 
values..  However,  there  is  too  little  data  on  fixed-time  runs 
between  9  and  12  minutes  to  set  the  duration  criterion  with 
greater  precision.  Clarification  of  this  issue  could  be  important 
because  time,  not  distance,  is  probably  the  key  factor  affecting 
run  test  validity.  For  example,  Sidney  and  Shephard's  (1977) 
elderly  men  and  women  produced  representative  validity 
coefficients  for  a  12-minute  run  despite  average  distances 
substantially  less  than  2  km  for  both  groups. 

Endurance  runs  are  more  valid  indicators  of  aerobic 
capacity  than  prior  reviews  suggest.  With  the  exception  of  Safrit 
et  al.'s  (1988)  work,  the  prior  reviews  suggest  validities  in  the 
range  of  . 60  <  r  <  .65  (Baumgartner  &  Jackson,  1982;  Hatch  & 
Henry,  1972;  Knapik,  1989).  Safrit  et  al.  (1988)  reported  a 
higher  value  after  correcting  for  measurement  error,  but  their 
raw  correlations  were  in  the  range  noted  in  other  reviews.  In 
contrast,  this  review  estimates  the  validity  of  endurance  run 
tests  at  i  =  .74.  The  inclusion  of  shorter  runs  in  the  prior 
reviews  is  part  of  the  reason  for  the  difference.  The  analysis  of 
boundary  cases  showed  that  even  a  slight  lowering  of  the  criteria 
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adds  run.  tests  with  substantially  lower  correlations,  thereby 
lowering  the  average. 

The  proposed  endurance  criteria  do  not  mean  that  shorter 
runs  are  invalid.  The  random-effects  PW  model  estimates  of 
validity  for  shorter  runs  commonly  used  to  assess  aerobic 
capacity  ranged  from  r  =  .543  for  a  600-yard  run  to  r  =  .618  for 
a  1-mile  run.  Clearly,  these  tests  are  not  invalid  as  the 
correlations  are  substantially  greater  than  zero.  Shorter  tests 
will  be  useful  for  estimating  aerobic  capacity  when  validities  in 
this  range  are  acceptable  and  there  is  some  reason  to  avoid 
having  the  study  population  run  the  additional  distance  required 
to  meet  the  minimum  endurance  criterion.  However,  using  a  shorter 
run  does  imply  a  substantial  drop  in  validity  relative  to 
endurance  runs.  Also,  factor  loadings  from  Disch  et  al.'s  (1975) 
analysis  suggest  that  the  estimates  of  aerobic  capacity  will  be 
moderately  contaminated  by  differences  in  anaerobic  power. 

The  endurance  criteria  are  linked  to  the  adoption  of  the  PW 
model  as  the  best  model  of  the  run  distance  and  validity.  That 
decision  was  supported  by  the  parsimony  of  the  PW  model.  Adopting 
the  TxT  model  would  increase  complexity  1150%  (from  2  to  23 
parameters)  to  improve  predictions  for  9%  (2  of  22)  of  the  run 
tests  representing  ~3%  of  the  total  data.  The  PW  model  also  has 
clear  connections  to  current  theoretical  models  of  running 
performance.  The  increasing  validity  up  to  2  km  can  be  explained 
by  bioenergetics.  Critical  power,  anaerobic  threshold,  and 
related  physiological  concepts  (Vandewalle,  Vautier,  Kachouri, 
LeChevalier,  &  Monod,  1997;  Walsh,  2000)  can  account  for  the 
range  of  stable  correlations.  These  concepts  predict  that  there 
is  critical  velocity  that  can  be  maintained  for  extended  periods 
of  time.  Optimal  running  strategy  is  to  maintain  a  constant  pace 
that  is  slightly  faster  so  that  anaerobic  resources  are  consumed 
evenly  over  the  course  of  the  run  (Fukuba  &  Whipp,  1999) .  Thus, 
each  individual  should  have  an  approximately  constant  pace  for 
longer  runs  that  is  determined  by  aerobic  capacity  and  influenced 
only  slightly  by  other  energetic  sources.  The  implication  is  that 
all  longer  runs  are  primarily  manifestations  of  a  single 
underlying  physiological  attribute.  From  statistical  perspective, 
the  tests  are  congeneric  (Lord  &  Novick,  1968)  and  should  have  an 
approximately  constant  correlation  to  the  criterion. 

The  TxT  model  predictions  would  be  hard  to  explain 
physiologically.  Mechanisms  would  have  to  be  identified  that 
could  account  for  an  up-and-down  pattern  of  validity 
coefficients.  The  pattern  might  be  viewed  as  a  combination  of  a 
general  upward  trend  with  cyclical  variation  about  that  trend 
that  damped  to  very  small  fluctuations  for  longer  runs.  It  is  not 
obvious  what  physiological  constructs  could  be  invoked  to  explain 
this  pattern. 
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Noting  some  limitations  of  this  review  puts  the  results  in 
proper  perspective.  The  conclusions  apply  with  greatest  certainty 
to  people  between  10  and  50  years  of  age.  Only  4  samples  in  this 
review  fell  outside  this  range.  The  risks  in  generalizing  beyond 
this  range  may  be  slight;  the  3  samples  of  older  individuals 
(Sidney  &  Shephard,  1977;  Tanaka,  Takeshima,  Kato,  Niihata,  & 
Ueda,  1990)  produced  correlations  comparable  to  those  for  younger 
people.  The  statistical  significance  estimates  must  be 
interpreted  cautiously.  Each  correlation  was  treated  as  an 
independent  observation  even  though  some  were  not.  More  complex 
computations  allowing  for  the  dependencies  would  yield  more 
precise  significance  estimates  (Becker  &  Schram,  1994;  Steiger, 
1978).  This  limitation  is  mitigated  by  the  fact  that  model 
selection  ultimately  focused  on  explanatory  precision,  not 
statistical  significance.  Also,  the  within-study  analysis  of 
correlations  provided  a  qualitative  test  of  the  model  that 
allowed  for  dependencies. 

The  most  important  limitation  of  this  review  is  that 
validity  generalization  has  not  been  addressed.  Validity  was 
stable  for  longer  runs  on  the  average,  but  there  was  substantial 
variation  around  that  average.  The  variation  may  indicate  that 
validity  is  lower  for  some  test  populations  than  for  others. 
Generalizability  is  critical  in  the  applied  use  of  run  tests 
(Baumgartner  &  Jackson,  1982;  Knapik,  1989;  Safrit  et  al.,  1988). 
This  review  provides  empirical  endurance  criteria  that  define  a 
population  of  run  tests  that  share  a  common  V02max- running 
performance  correlation.  The  null  hypothesis  in  generalizability 
analyses  is  that  different  populations  of  people  share  a  common 
population  correlation.  This  hypothesis  is  plausible  if  run  tests 
are  sampled  from  the  population  of  tests  defined  by  the  endurance 
criteria  developed  here.  The  present  findings,  therefore,  provide 
a  starting  point  for  proper  selection  of  correlations  suitable 
for  testing  generalizability  hypotheses.  The  present  findings 
also  identify  test  type  as  one  factor  that  affects  validity.  The 
average  validity  of  fixed-time  endurance  tests  (r  =  .797)  was 
significantly  higher  than  that  of  fixed-distance  tests  (r  =  .718, 
X2  =  30.65,  1  df,  p  <  .001) .  A  companion  review  (Vickers,  in 
preparation)  will  use  the  present  findings  as  a  point  of 
departure  for  a  detailed  exploration  of  generalizability  issues. 

The  applied  uses  of  the  findings  can  be  illustrated  by 
answering  the  two  questions  raised  in  the  introduction.  "How  long 
does  a  run  have  to  be  to  be  valid?"  If  r  =  .70  were  the  minimum 
acceptable  validity  coefficient,  the  lainimum  distance  would  be  2 
km.  Reducing  the  distance  by  as  little  as  0.4  km  (i.e.,  to  1 
mile)  wold  incur  a  significant  loss  of  validity  (r  =  .63). 
Regarding  the  second  question,  "If  the  current  test  is  2  km  in 
length,  how  much  will  be  gained  by  increasing  the  distance?",  the 
indicates  nothing  will  be  gained.  However,  they  may  be 
some  benefit  to  switching  from  a  fixed-distance  test  to  a  fixed¬ 
time  test.  The  data  also  suggest  an  answer  to  a  third  important 
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applied  question,  "What  is  the  highest  validity  that  can  be 
achieved  with  run  tests?"  The  best  answer  from  this  review  is  r 
=  .80.  If  this  validity  is  not  acceptable,  some  other  method  of 
estimating  aerobic  capacity  must  be  used. 

Safrit  et  al .  (1988)  concluded  that  their  review  of  the 

VChmax- running  performance  literature  provided  a  framework  for 
future  studies.  This  review  has  elaborated  on  the  line  of  study 
initiated  in  that  paper  by  developing  a  quantitative  model  of  the 
effect  of  run  distance  on  validity.  The  model  has  two  important 
consequences.  First,  it  provides  an  empirical  definition  of 
endurance  runs.  Second,  the  model  indicates  that  the  validity  of 
run  tests  as  indicators  of  aerobic  capacity  is  higher  than 
suggested  in  previous  reviews.  The  empirical  definition  of  an 
endurance  test  is  a  necessary  starting  point  for  validity 
generalization  analyses  that  are  the  subject  of  a  companion 
review  (Vickers,  in  preparation) .  The  immediate  payoffs  from  the 
model  developed  here  include  the  possibility  of  making  explicit 
tradeoffs  between  distance  and  validity  when  appropriate. 

Overall,  the  PW  model  should  promote  better  understanding  and 
more  effective  use  of  run  tests  as  aerobic  capacity  indicators. 
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Footnotes 


1Safrit  et  al.  (1988)  reported  two  values,  r  =  .771  in  the  text 
and  r  -  .741  in  Table  1.  Whichever  value  is  correct,  the  analysis 
procedures  corrected  for  measurement  error.  The  weighted  average 
of  the  reliability  data  reported  in  the  paper  was  rxx  =  .892  for 
run  tests  and  ryy  =  .7  53  for  V02max  measurements.  Inserting  these 
values  into  standard  equations  to  correct  for  measurement  error 
(Hunter,  Schmidt,  &  Jackson,  1982,  p.  54-59),  the  correction 
procedure  can  be  reversed  to  yield  the  uncorrected  correlations: 

^xy  =  Axy*  SQRT  ( rxx*  ryy ) 

=  . 771*SQRT( . 892* . 753 ) 

=  .771*. 820 

=  .  632 

or 

=  .608  (if  Axy  =  .741)  . 

2The  referent  for  "validity"  has  been  specified  because  test 
validity  is  the  appropriateness  of  the  interpretation  of  test 
scores  (American  Psychological  Association,  1985) .  Most  tests 
have  more  than  one  interpretation  and,  therefore,  more  than  one 
validity.  For  example,  a  run  test  could  be  interpreted  as 
performance  indicator  rather  than  an  estimate  of  aerobic 
capacity.  This  review  examines  run  tests  as  estimates  of  aerobic 
capacity  or  cardiorespiratory  fitness.  Unless  otherwise 
indicated,  that  reference  is  the  sole  meaning  of  validity  when 
the  term  is  used  in  this  paper. 

JThe  2000  meter  split  point  for  the  group  classif ication  may 
appear  too  low  when  examining  Figure  1 .  The  graph  flattens  at  a 
point  closer  to  2400  meters.  This  appearance  is  misleading.  LOESS 
procedures  compute  the  y  value  for  an  (x,y)  pair  by  taking  a 
weighted  average  of  observed  y  values  over  a  range  of  x  values. 
The  weights  are  larger  for  data  points  near  the  x  value  than  for 
more  distant  data  points  (Cleveland,  1979) .  The  procedure  is 
designed  to  yield  a  smoother,  robust  representation  of  the  data. 
The  resulting  graph  will  be  misleading  if  there  are  real 
discontinuities  in  the  data  such  as  that  embodied  in  the  PW 
model.  The  LOESS  approach  will  yield  an  artifactually  smooth 
increase  in  the  curve  near  the  transition  point.  The  curve  will 
be  smooth  and  increasing  in  the  transition  region  because  it 
averages  increasing  points  below  the  transition  with  constant 
points  above.  As  the  weights  assigned  to  points  in  each  domain 
shift,  the  curve  will  increase  smoothly.  The  stable  value  above 
the  transition  point  will  be  reached  only  when  the  weights 
assigned  to  shorter  distances  all  are  near  zero.  This  condition 
will  be  satisfied  only  after  the  x  value  is  well  above  the  actual 
transition  point.  Thus,  a  smooth  curve  from  approximately  2.4 
kilometers  onward  indicates  a  transition  point  somewhere  below 
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this  value.  Other  aspects  of  the  analysis  indicate  that  2 
kilometers  is  reasonable  from  this  perspective. 

4 An  apparent  conflict  between  the  time  and  distance  definitions 
of  endurance  runs  should  be  noted.  The  average  distance  covered 
in  a  12-minute  run  test  is  2.5  km,  well  above  the  2.0-km 
criterion.  Regressing  average  time  on  average  distance  for  fixed- 
distance  tests,  the  predicted  average  time  for  2  km  is  8:44.  This 
prediction  is  well  below  the  12-minute  endurance.  Note,  however, 
that  both  predicted  criteria  refer  to  average  values.  The  more 
appropriate  reference  point  might  be  the  time  or  distance 
required  for  the  95rh  percentile  individual.  That  reference  point 
would  be  more  appropriate  given  that  all  individuals  have  to 
complete  the  time  or  distance  in  the  standard  version  of  the 
tests.  That  reference  point  would  be  expected  to  yield  closer 
correspondence  between  the  criteria. 
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